Fault-Tolerance for High-Performance Multi-Module VLSI Systems Using Micro Rollback

نویسندگان

  • Marc Tremblay
  • Yuval Tamir
چکیده

In order to achieve fault tolerance, highly reliable systems often require hardware-supported concurrent error detection for all system components. Checkers are connected in the communication paths from each module to the rest of the system, reducing system performance by requiring either longer clock cycles or additional pipeline stages. The performance penalty of concurrent error detection can be minimized by performing the checks in parallel with the transmission of information between modules, thus removing the delay for detection from the critical path. Erroneous information may thus reach a module several clock cycles before an error indication. Operations based on this information are ‘‘undone’’ using micro rollback — a hardware mechanism for rapid rollback of a few cycles. In this paper we show how micro rollback can be efficiently implemented in complex VLSI systems consisting of multiple modules which interact asynchronously. The implementation of key building blocks in CMOS VLSI is described and evaluated.

منابع مشابه

The implementation and application of micro rollback in fault-tolerant VLSI systems

Concurrent error detection and confinement requires checking the outputs of every module in the system during every clock cycle. This is usually accomplished by checkers and isolation circuits in the communication paths from each module to the rest of the system. This additional circuitry reduces system performance. We present a technique, called micro rollback, which allows most of the perform...

متن کامل

Support for Fault Tolerance in Vlsi Processors †

Fault tolerance techniques are used to allow computer systems to continue correct operation despite component failure. Hardware-supported concurrent error-detection and limited fault tolerance in system components, as implemented by coding or replication, are often required. Detection latency can be reduced by increasing the visibility of internal module state using compressed ‘‘signatures’’ of...

متن کامل

Control-flow checking using watchdog assists and extended-precision checksums

troller), it has the advantage of maintaining sequential consistency, thus allowing parallel programs to work as expected. VII. SUMMARY AND CONCLUSIONS The delays due to error checking being performed in series with intermodule communication are one of the primary causes of performance degradation associated with implementing concurrent error detection and correction in VLSI systems. This perfo...

متن کامل

Applying Fault-Tolerance Principles to Security Research

We have been conducting research in reliable distributed systems in the last twenty years. We have worked on the development of concepts such as consistency, atomicity, durability, availability, rollback, check points, adaptability etc. [1,2]. IEEE symposium on Reliable Distributed systems held every year contains many of the papers dealing with high availability, dependability, and non-stop op...

متن کامل

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a softwa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1989